This code seeks to explain how do organoids from patients reflect the generality of OV patients. We do clustering of patients and organoids and see where organoids fall – based on signatures, and also based on raw CN profiles.
To do: Create the new britroc OV exposures using the new code that Ruben has provided and the new absolute copy number segments that he has provided too.
Changes in signatures extraction (from Ruben) - removing a big from one the features: the first segment was not counted, whih is not too important for OV - the pre-processing of CN segments (only applicable to SNP array)
The previous data are:
BriTROC: there are the original BriTROC segments (from manuscript) and new BriTROC segments (called BriTROC 2, here, but it’s not the BriTROC-2 cohort! made by Ruben).
org<- as(readRDS("data/organoid_exposures_Aug21.rds"), 'matrix')
rownames(org) <- paste0('Sample ', 1:nrow(org))
names_orgs = readxl::read_xlsx("data/NewOrganoidNaming.xlsx")
## Creating plot... it might take some time if the data are large. Number of samples: 18
We are loading both the original signatures, adn the updated signatures.
We have two dataframes: with the previous TCGA samples and with the current ones. Both contain the BriTROC and ICGC to this as well (which are shared).
## The percentage of zeros in each cohort is:
## $organoids
## $organoids[[1]]
## [1] "13.492%"
##
##
## $ExposuresNatGen
## $ExposuresNatGen$britroc
## [1] "0%"
##
## $ExposuresNatGen$`OV-AU`
## [1] "0%"
##
## $ExposuresNatGen$`OV-US`
## [1] "0%"
##
## $ExposuresNatGen$TCGA
## [1] "0%"
##
##
## $UpdatedExposures
## $UpdatedExposures$britroc
## [1] "0%"
##
## $UpdatedExposures$`OV-AU`
## [1] "0%"
##
## $UpdatedExposures$`OV-US`
## [1] "0%"
##
## $UpdatedExposures$`Updated TCGA`
## [1] "24.305%"
This makes the organoids and the TCGA exposures sample, and leaves the other in the periphery of the PCA. I suspect this is due to the number of zero exposures, which are imputated using the robust analyses that I am using here:
We are only selecting the updated exposures, now
which_natgen = 'UpdatedSignatures'
For compositional data, in the book Analysing compositional data with R they say that PCA should be done on clr-transformed data. Zeroes are an issue if we use clr using all samples. The robust clr is implemented in the package compositions and deals with this problem by doing the geometric mean over only non-zero values, and setting the clr of a part which is zero to zero.
The plot done with (biplot(princomp(acomp(x)))) is the same as plotting princomp(as(clr(x), ‘matrix’))
## Saving 7 x 5 in image
## Saving 7 x 5 in image
I.e., what type of signatures are not represented in the organoids?
Conclusion: it seems as though it’s signature 3, the relative abundance of which is never high in organoid samples.
I am comparing
the barplots of the exposures
CLR (centered log-ratio) of signature 3 is high in the underrepresented samples
the ratio of the sums of different signatures, e.g. the ratio of 1+3+5 vs 2+4+6+7.
ILR (isometric log-ratio) when splitting the dataset into s3 and all other signatures. It is the log-ratio of the exposure to signature 3 and the geometric mean of all other exposures.
## Creating plot... it might take some time if the data are large. Number of samples: 159
## Creating plot... it might take some time if the data are large. Number of samples: 50
## Creating plot... it might take some time if the data are large. Number of samples: 18
Looking at the loadings. In particular, looking for components in the first and second PC
Respectively, using the first and the second batch of signatures.
Signatures 3 and 6 seem to be quite important for the underrepresented groups
The colour of the labels shows whether there is any zero exposure in the vector of exposures of the sample.
## removed due to infinite values
## quartz_off_screen
## 2
## quartz_off_screen
## 2
## quartz_off_screen
## 2
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## quartz_off_screen
## 2
## quartz_off_screen
## 2
labels(dendro_UpdatedExposures)[grep('Sample', labels(dendro_UpdatedExposures))]
## [1] "Organoid Sample 17" "Organoid Sample 4" "Organoid Sample 11"
## [4] "Organoid Sample 2" "Organoid Sample 18" "Organoid Sample 15"
## [7] "Organoid Sample 12" "Organoid Sample 9" "Organoid Sample 8"
## [10] "Organoid Sample 7" "Organoid Sample 14" "Organoid Sample 1"
## [13] "Organoid Sample 10" "Organoid Sample 16" "Organoid Sample 13"
## [16] "Organoid Sample 3" "Organoid Sample 5" "Organoid Sample 6"
We are looking at the split plot below (i.e. the first split). We call ‘underrepresented’ the samples that fall on the right branch.
## Number of organoids in underrepresented and represented split: 0 18
There are two types of population which are not represented:
These are the exposures for some signatures, in the PCA projection.
To make sure this is not due to the type of signatures we are using (since the array ones have more zeros)
## Fraction of samples with any zero in underrepresented: 0.3404255
## Fraction of samples with any zero in represented: 0.7223199
additional genomic data comparing the tumours to the organoids in terms of ploidy, number of rearrangements and any other things that you think could be relevant
pcawg_CN_features = readRDS("data/pcawg_CN_features.rds")
tcga_CN_features = readRDS("data/tcga_CN_features.rds")
BriTROC_absolute_copynumber = readRDS("data/BriTROC_absolute_copynumber.rds")
BriTROC2_CN_features = readRDS("data/6_TCGA_Signatures_on_BRITROC/0_BRITROC_absolute_CN.rds")
organoids_absolute_copynumber = readRDS("data/organoid_absolute_CN.rds")
sampleNames(organoids_absolute_copynumber) = names_orgs$`new name`[match(gsub("org", "", sampleNames(organoids_absolute_copynumber)), names_orgs$`old name`)]
organoids_CN_features = extractCopynumberFeatures(organoids_absolute_copynumber)
BriTROC_CN_features = readRDS("data/BriTROC_CN_features.rds")
The number of segments can be taken eitehr from segsize (first column) or from copynumber (last column). This is just for PCAWG and TCGA! Not for BriTROC. Any idea why this is the case?
** Note I am plotting this as the log!**
## quartz_off_screen
## 2
TL;DR with a negative binomial model, which is much more appropriate in this setting than a Poisson, there is no difference in the distributions of organoids and non-organodis when it comes to number of segments.
## Analysis of Deviance Table
##
## Model 1: length ~ 1
## Model 2: length ~ bool
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 824 58116
## 2 823 58024 1 91.318 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Warning in anova.negbin(reduced_nb, full_nb, test = "LRT"): only Chi-squared LR
## tests are implemented
## Likelihood ratio tests of Negative Binomial Models
##
## Response: length
## Model theta Resid. df 2 x log-lik. Test df LR stat. Pr(Chi)
## 1 1 3.076226 824 -9971.807
## 2 bool 3.081258 823 -9970.344 1 vs 2 1 1.46259 0.2265185
Unfortunately the scaling factor has to do with the width of the bins in the histogram.
To get the ploidy, I just have to compute the weighted average of the copy number segments (this is computed from the absolute copy number profiles objects, since they specify, for each segment, its length and its ploidy).
Use getSegTable to get the segments from this Biobase file
## we only want the ovarian ones
ICGC_absolute_copynumber_AU = readRDS("data/CN_Calls_ABSOLUTE_PCAWG/OV-AU.segments.raw.rds")
ICGC_absolute_copynumber_US = readRDS("data/CN_Calls_ABSOLUTE_PCAWG/OV-US.segments.raw.rds")
ICGC_absolute_copynumber_AU = ICGC_absolute_copynumber_AU[,c('sample', 'chr', 'startpos', 'endpos', 'segVal')]
ICGC_absolute_copynumber_US = ICGC_absolute_copynumber_US[,c('sample', 'chr', 'startpos', 'endpos', 'segVal')]
segtables_ICGC_absolute_copynumber_AU = lapply(sort(unique(ICGC_absolute_copynumber_AU$sample)),
function(samplename)
ICGC_absolute_copynumber_AU[ICGC_absolute_copynumber_AU$sample == samplename,])
segtables_ICGC_absolute_copynumber_AU = lapply(segtables_ICGC_absolute_copynumber_AU, function(i) { colnames(i)[colnames(i) == "chr"] = "chromosome";
colnames(i)[colnames(i) == "endpos"] = "end";
return(i) } )
names(segtables_ICGC_absolute_copynumber_AU) = unique(ICGC_absolute_copynumber_AU$sample)
segtables_ICGC_absolute_copynumber_US = lapply(sort(unique(ICGC_absolute_copynumber_US$sample)),
function(samplename) ICGC_absolute_copynumber_US[ICGC_absolute_copynumber_US$sample == samplename,])
segtables_ICGC_absolute_copynumber_US = lapply(segtables_ICGC_absolute_copynumber_US, function(i) { colnames(i)[colnames(i) == "chr"] = "chromosome";
colnames(i)[colnames(i) == "endpos"] = "end";
return(i) } )
names(segtables_ICGC_absolute_copynumber_US) = unique(ICGC_absolute_copynumber_US$sample)
## for ICGC, remove the samples row and put it in the rows
segtables_ICGC_absolute_copynumber_US = lapply(segtables_ICGC_absolute_copynumber_US, function(i){
rownames(i) = i$samples
i = i[,-1]
i})
segtables_ICGC_absolute_copynumber_AU = lapply(segtables_ICGC_absolute_copynumber_AU, function(i){
rownames(i) = i$samples
i = i[,-1]
i})
## Check that there are no sex chromosomes included anywhere
## [1] TRUE TRUE TRUE TRUE TRUE
Ploidy is not normally distributed and it’s right-skewed. Moreover, the distribution is bimodal: I guess there are genomes in which there is a clear amplification and genomes which are more or less normal, so centered around 2.
I am also using a robust linear regression, but I don’t think this is suitable either.
t.test(log(ploidy_organoids), log(ploidy_BriTROC))
##
## Welch Two Sample t-test
##
## data: log(ploidy_organoids) and log(ploidy_BriTROC)
## t = -0.90962, df = 20.948, p-value = 0.3734
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.16268846 0.06368736
## sample estimates:
## mean of x mean of y
## 0.9401143 0.9896149
MASS::rlm(ploidy~group,
data=cbind.data.frame(ploidy=c(ploidy_organoids, ploidy_BriTROC), group=c(rep(1,length(ploidy_organoids)), rep(2, length(ploidy_BriTROC)))))
## Call:
## rlm(formula = ploidy ~ group, data = cbind.data.frame(ploidy = c(ploidy_organoids,
## ploidy_BriTROC), group = c(rep(1, length(ploidy_organoids)),
## rep(2, length(ploidy_BriTROC)))))
## Converged in 5 iterations
##
## Coefficients:
## (Intercept) group
## 2.4791559 0.1280964
##
## Degrees of freedom: 298 total; 296 residual
## Scale estimate: 0.675
## Segments across the genome
# (sapply(chrlen$V1, function(i) gsub("chr", "", i)))
sorted_chroms = chrlen$V1[order(as.numeric((gsub("chr", "", chrlen$V1))))]
## Warning in eval(quote(list(...)), env): NAs introduced by coercion
chrom_lenths = chrlen[match(sorted_chroms, chrlen$V1),]
## Warning: Removed 807 rows containing missing values (geom_label_repel).
## Warning: Removed 921 rows containing missing values (geom_label_repel).
## Warning: Removed 807 rows containing missing values (geom_label_repel).
## quartz_off_screen
## 2
## Warning: Removed 921 rows containing missing values (geom_label_repel).
## quartz_off_screen
## 2